fix: clear stale pod-name annotation instead of hard error#521
fix: clear stale pod-name annotation instead of hard error#521noeljackson wants to merge 1 commit intokubernetes-sigs:mainfrom
Conversation
When the pod tracked by agents.x-k8s.io/pod-name doesn't exist (deleted during warm pool rotation, eviction, or image pull failure), the controller returned a hard error, leaving the Sandbox stuck in a reconcile loop unable to create a replacement pod. Now the controller clears the stale annotation and falls through to pod creation. The new pod gets tracked via ensurePodNameAnnotation.
✅ Deploy Preview for agent-sandbox canceled.
|
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: noeljackson The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @noeljackson. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/ok-to-test |
|
@noeljackson: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
codebot-robot
left a comment
There was a problem hiding this comment.
Overall, this PR provides an excellent and solid fix for resolving the nasty edge case of a stale pod annotation loop that blocks Sandbox reconciliation when a tracked backing Pod is deleted out-of-band (e.g., evicted or rotated from the warm pool). The approach of proactively clearing the stale annotation and falling back to standard pod creation is safe, reliable, and aligns well with standard Kubernetes controller patterns.
The PR handles object mutation correctly by deep-copying before modifying the cached object to generate the strategic merge patch, and the tests comprehensively verify both the new Pod creation and the final updated state of the Sandbox's annotations.
I've left a few minor inline comments pointing out some good practices used here, validating the architectural approach with the explicit patch, and suggesting enhancements for observability, stronger error context, and expanded test coverage (specifically ensuring that client.MergeFrom correctly preserves unrelated annotations). No blocking issues found. Great work on this fix!
(This review was generated by Overseer)
Cherry-picks two upstream fixes: 1. kubernetes-sigs#521 — When an adopted warm pool pod is deleted (node failure, drain, eviction), the controller returned a hard error because the agents.x-k8s.io/pod-name annotation pointed to a non-existent pod. This left the Sandbox stuck in a permanent reconcile error loop. Now the controller clears the stale annotation and falls through to create a replacement pod (which remounts the existing PVC). 2. kubernetes-sigs#469 — During warm pool adoption, ensure the pod-name annotation is correct before the sandbox can be observed as Ready. Prevents stale annotations from being set in the first place. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Cherry-picks two upstream fixes: 1. kubernetes-sigs#521 — When an adopted warm pool pod is deleted (node failure, drain, eviction), the controller returned a hard error because the agents.x-k8s.io/pod-name annotation pointed to a non-existent pod. This left the Sandbox stuck in a permanent reconcile error loop. Now the controller clears the stale annotation and falls through to create a replacement pod (which remounts the existing PVC). 2. kubernetes-sigs#469 — During warm pool adoption, ensure the pod-name annotation is correct before the sandbox can be observed as Ready. Prevents stale annotations from being set in the first place. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Summary
When the pod tracked by
agents.x-k8s.io/pod-nameannotation doesn't exist, clear the stale annotation and fall through to pod creation instead of returning a hard error.Problem
The
ensurePodNameAnnotationfunction (commit 32cddd3) records the backing pod's name on the Sandbox CR. This is used for stable pod tracking across reconciliations. However, when the annotated pod is deleted (warm pool rotation, eviction, image pull failure),reconcilePodreturns a hard error:The controller never reaches PATH 3 (create pod). The Sandbox is stuck in a reconcile error loop and the warm pool never becomes ready.
Fix
When the annotated pod isn't found, clear the stale annotation and let
pod = nilfall through to pod creation:The subsequent
ensurePodNameAnnotationcall after pod creation re-sets the annotation to track the new pod.Test plan
TestReconcilePodClearsStaleAnnotation— sandbox with stale annotation pointing to non-existent pod creates a new pod and updates the annotation